dataset paper
- North America > United States > Texas > Travis County > Austin (0.04)
- North America > United States > Minnesota (0.04)
- North America > United States > Massachusetts > Suffolk County > Boston (0.04)
- (3 more...)
- Information Technology (1.00)
- Health & Medicine > Therapeutic Area (1.00)
- Health & Medicine > Pharmaceuticals & Biotechnology (0.93)
- Law > Intellectual Property & Technology Law (0.68)
- North America > United States > Texas > Travis County > Austin (0.04)
- North America > United States > Minnesota (0.04)
- North America > United States > Massachusetts > Suffolk County > Boston (0.04)
- (3 more...)
- Information Technology (1.00)
- Health & Medicine > Therapeutic Area (1.00)
- Health & Medicine > Pharmaceuticals & Biotechnology (0.93)
- Law > Intellectual Property & Technology Law (0.68)
A Systematic Review of NeurIPS Dataset Management Practices
Wu, Yiwei, Ajmani, Leah, Longpre, Shayne, Li, Hanlin
As new machine learning methods demand larger training datasets, researchers and developers face significant challenges in dataset management. Although ethics reviews, documentation, and checklists have been established, it remains uncertain whether consistent dataset management practices exist across the community. This lack of a comprehensive overview hinders our ability to diagnose and address fundamental tensions and ethical issues related to managing large datasets. We present a systematic review of datasets published at the NeurIPS Datasets and Benchmarks track, focusing on four key aspects: provenance, distribution, ethical disclosure, and licensing. Our findings reveal that dataset provenance is often unclear due to ambiguous filtering and curation processes. Additionally, a variety of sites are used for dataset hosting, but only a few offer structured metadata and version control. These inconsistencies underscore the urgent need for standardized data infrastructures for the publication and management of datasets.
- North America > United States > Texas > Travis County > Austin (0.04)
- North America > United States > New York > New York County > New York City (0.04)
- North America > United States > Minnesota (0.04)
- (4 more...)
- Information Technology (1.00)
- Health & Medicine > Therapeutic Area (1.00)
- Health & Medicine > Pharmaceuticals & Biotechnology (0.93)
- Law > Intellectual Property & Technology Law (0.68)
Interview with Jerone Andrews: a framework towards evaluating diversity in datasets
Jerone Andrews, Dora Zhao, Orestis Papakyriakopoulos and Alice Xiang won a best paper award at the International Conference on Machine Learning (ICML) for their position paper Measure Dataset Diversity. We spoke to Jerone about the team's methodology, and how they developed a framework for conceptualising, operationalising, and evaluating diversity in machine learning datasets. In our paper, we propose using measurement theory from the social sciences as a framework to improve the collection and evaluation of diverse machine learning datasets. Measurement theory offers a systematic and scientifically grounded approach to developing precise numerical representations of complex and abstract concepts, making it particularly suitable for tasks like conceptualising, operationalising, and evaluating qualities such as diversity in datasets. This framework can also be applied to other constructs like bias or difficulty.